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Abstract 

We take a new look at parameter estimation for Gaussian Mixture Models (GMMs). In particu¬ 
lar, we propose using Riemannian manifold optimization as a powerful counterpart to Expectation 
Maximization (EM). An out-of-the-box invocation of manifold optimization, however, fails spec¬ 
tacularly: it converges to the same solution but vastly slower. Driven by intuition from manifold 
convexity, we then propose a reparamerization that has remarkable empirical consequences. It 
makes manifold optimization not only match EM—a highly encouraging result in itself given the 
poor record nonlinear programming methods have had against EM so far—but also outperform 
EM in many practical settings, while displaying much less variability in running times. We further 
highlight the strengths of manifold optimization by developing a somewhat tuned manifold LBFGS 
method that proves even more competitive and reliable than existing manifold optimization tools. 
We hope that our results encourage a wider consideration of manifold optimization for parameter 
estimation problems. 


1 Introduction 

Gaussian Mixture Models (GMMs) are widely used in a variety of areas, including machine learning 
and signal processing [5, 11, 15, 18, 20]. A quick search of the literature suggests that for estimating 
parameters of a GMM the Expectation Maximization (EM) algorithm [10] is a de facto choice. Although 
other numerical approaches have also been considered [23], methods such as conjugate gradients, quasi- 
Newton, Newton, are typically inferior to EM [33] in many practical settings. 

The main difficulty of applying standard nonlinear programming techniques for GMMs is optimiza¬ 
tion over covariance matrices. The positive definiteness constraint, although an open subset of Euclidean 
space, can be difficult to handle, especially for higher-dimensional problems. When approaching the 
boundary of the constraint set, convergence speed of iterative methods can also get adversely affected. 
A partial remedy for these difficulties is to use the Cholesky decomposition, as was also exploited for 
semidefinite programming in [8] . But as pointed out in [29] , for general optimization problems (even for 
semidefinite programs) such a nonconvex decomposition adds many more stationary points and possi¬ 
bly spurious local minima. One can formulate the positive definiteness constraint via a set of smooth 
convex inequalities [29] and resort to interior-point methods. It was observed in [26] that using such 
sophisticated methods can be extremely slower (on a class of statistical problems) than simpler EM-like 
fixed point iterations, especially for higher dimensions. 

In this paper we reconsider the above viewpoint and take a new look at nonlinear optimization 
techniques for GMM parameter estimation, which can not only match EM but often also outdo it. We 
believe that matching EM’s performance on nontrivial GMMs using such numerical methods is already 
remarkable. Even more interesting are instances where we substantially outperform EM. 

Specifically, we approach GMM parameter estimation via Riemannian Manifold Optimization. We 
turn to manifold optimization motivated by a simple observation: the positive definiteness constraint on 
covariance matrices poses difficulties to all numerical methods (gradient-descent, conjugate gradients, 
quasi-Newton, etc.); and one way to ameliorate these difficulties is by operating directly on the manifold 
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of positive definite matrices^. Therewith, one implicitly satisfies the constraints, and can devote greater 
effort to the maximization of the log-likelihood. 

A reader familiar with the simplicity and elegance of EM may question the above motivation. And 
this skepticism is justified: an out-of-the-box invocation of manifold optimization turns out to be vastly 
inferior to EM. So, should we discard manifold optimization too? No. But we do need to develop a 
more refined approach; we outline our ideas below. 

Intuitively, the mismatch lies in the geometry. Recall that for GMMs, the M-step of EM is a 
Euclidean convex optimization problem (which even has a closed form solution), whereas the log- 
likelihood is not manifold convex^ even for a single Gaussian. This suggests that it may be fruitful to 
consider a reparametrization which makes at least the single component log-likelihood manifold convex. 
This intuition turns out to have remarkable empirical consequences (Fig. 1), which ultimately enables 
manifold optimization to compete with EM and often even surpass it. 

Contributions. In light of the above background, the main contributions of this paper are as 
follows: 

- Introduction of manifold optimization as a powerful numerical tool for GMM parameter estimation. 
Most importantly, we show how a simple reparamerization holds the key to making manifold 
optimization succeed. 

- Development of a solver based on manifold-LBFGS; our key contribution here is the design and 
implementation of a powerful line-search procedure. This line-search helps ensure convergence, 
and beyond that, it helps LBFGS outperform both EM and the usual manifold conjugate gradient 
(GG) method; our solver may thus also be of independent interest. 

- Experimental evidence on both synthetic and real-data to show a performance comparison between 
manifold optimization and EM. 

As may be gleaned from our results, manifold optimization performs well across a wide range of pa¬ 
rameter values and problem sizes, while being much less sensitive to overlapping data than EM, and 
displaying less variability in running times. These results are encouraging and suggest that manifold 
optimization could open a new algorithmic avenues for handling mixture models. 

We would like to note that for ensuring reproducibility of our results and as a service to the commu¬ 
nity, we will release our Matlab implementation of the methods developed in this paper. The manifold 
CG method that we use is directly based on the excellent toolkit ManOpt [7]. 

Related work. The published work on EM is huge, so a summary is impossible. Instead, let us briefly 
mention a few lines of related work. Xu and Jordan [33] examine several aspects of EM for GMMs and 
counter the claims of Redner and Walker [23], who thought EM to be inferior to general purpose 
nonlinear programming techniques, especially second-order methods. However, it is well-known, see 
e.g., [23, 33], that EM can attain good likelihood values rapidly, and it scales to much larger problems 
than amenable to second-order methods. Local convergence analysis of EM is available in [33], with 
more refined and precise results in [17], who formally show that when data have low overlap, EM 
can converge locally superlinearly. Our paper develops manifold LBFGS, which being a quasi-Newton 
method can also display local superlinear convergence. 

For GMMs some innovative gradient-based methods have also been suggested [21, 25]. In order to 
satisfy positive definite constraint, the authors suggest to use Cholesky decomposition of covariance 
matrices. Such a reparametrization makes the objective function of even a single Gaussian nonconvex, 
and adds spurious stationary points to the objective function. Also, these works report results only for 
low-dimensional problems and spherical (near spherical) covariance matrices. 

^Equivalently, on the interior of the constraint set, as is done by interior point methods (their nonconvex versions); 
though these turn out to be slow too as they are second order methods. 

^That is, convex along geodesic curves on a manifold. 
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The idea of manifold optimization is new for GMM, but in itself it is a well-developed branch of 
nonlinear optimization. A classic reference is [28]; a more recent work is [1]; and even a Matlab toolbox 
exists now [7]. In machine learning, manifold optimization has witnessed increasing interest^, e.g., for 
low-rank optimization [14, 30], or optimization based on geodesic convexity [26, 32]. 

Beyond numerics, there is substantial interest in theoretical analysis of mixture models [3, 9, 12, 19]. 
These studies are of great theoretical value (though sometimes limited to either low-dimensional, or 
small number of mixture components, or spherical Gaussians, etc.), but are orthogonal to our work 
which focuses on highly practical algorithms for general GMMs. 

2 Background and problem setup 

We begin with some background material, which also serves to establish notation. The key quantity in 
this paper is the Gaussian Mixture Model (GMM) for vectors x G 

P(x) ■= 2^ - , Mi, Sj), 

where pj\/ is a (multivariate) Gaussian density with mean G and covariance S >- 0, i.e., 

PAr(x; ( 1 , S) := det(S)“^/^(27r)“''/^ exp(-i(a; - - fj,)). 

Given i.i.d. samples {xi, ... ,Xn}, we seek to estimate {fij G 0}jLi and a G Ak, the K- 

dimensional probability simplex, via maximum likelihood estimation. This task requires solving the 
GMM optimization problem: 


max 




K 

i=i 


, Mi, 



( 2 . 1 ) 


Problem (2.1) in general can require exponential time [19].^ However, our focus is more pragmatic: 
similar to EM, we also seek to efficiently compute local solutions. Our methods are set in the framework 
of manifold optimization [1, 28]; so let us now recall some material on manifolds. 


2.1 Manifolds and geodesic convexity 

A smooth manifold is a non-Euclidean space that locally resembles Euclidean space [16]. For opti¬ 
mization, it is more convenient to consider Riemannian manifolds (smooth manifolds equipped with an 
inner product on the tangent space at each point). These manifolds possess structure that allows one 
to extend the usual nonlinear optimization algorithms [1, 28] to them. 

Algorithms on manifolds often rely geodesics, i.e., curves that (locally) join points along shortest 
paths. Geodesics help generalize Euclidean convexity to geodesic convexity. In particular, say A1 is a 
Riemmanian manifold, and x,y G A4', also let 

'Yxy ■ [0,1] y Ad, 7x^(0) = X, 7xj/(l) = y, 

be a geodesic joining x to y. Then, a set A C A1 is geodesically convex if for all x,y G A there is 
a geodesic contained within A. Further, a function / : A —K is geodesically convex if for all 
x,y G A, the composition / o : [0,1] —K is convex in the usual sense. 

The manifold of interest to us in this paper is P^, the manifold of d x d symmetric positive definite 
matrices. At any point S S P^*, the tangent space is isomorphic to entire set of symmetric matrices; 

^Manifold optimization should not be confused with “manifold learning” a separate problem altogether. 

^Though recent work shows that under strong assumptions, it has polynomial smoothed complexity [12]. 
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and the Riemannian metric at S is given by tr(S This metric induces a geodesic from 

Si to S 2 that happens to even have a closed-form, specifically [4], 

W := S}/'(S7'/2S2S7l/')‘S}/^ 0 < t < 1. 

Thus, a function / : —>■ K if geodesically convex on P^^ if it satisfies 

/(7Si,i:.(i)) < (1 - 0/(Si) + t/(S2), t G [0,1], Si, S 2 G 71. 

Such functions can be nonconvex in the Euclidean sense, but remain globally optimizable due to geodesic 
convexity. This property has been important in some matrix theoretic applications [4, 27], and has 
gained more extensive coverage in several recent works [24, 26, 32]. 

We emphasize that even though the mixture cost (2.1) is not geodesically convex, for GMM opti¬ 
mization geodesic convexity seems to play a crucial role, and it has a huge impact on convergence speed. 
This behavior is partially expected and analogous to EM, where a convex M-Step makes the overall 
method much more practical. The next section uses this intuition to elicit geodesic convexity. 

2.2 Problem reformulation 

We begin with parameter estimation for a single Gaussian: although this has a closed-form solution 
(which ultimately benefits EM), it requires more subtle handling when applying manifold optimization. 
Gonsider therefore, maximum likelihood parameter estimation for a single Gaussian: 

E TI 

. logPAt(®i;M,S). (2.2) 

2 = 1 

Although (2.2) is convex in the Euclidean sense, it is not geodesically convex on its domain x P^^, 
which makes it geometrically not so well-suited to the positive definite matrix manifold. 

To fix this mismatch and turn (2.2) into a geodesically convex problem, we invoke a simple reparamer- 
ization® that has far-reaching impact. We augment the sample vectors Xi by an extra dimension and 
consider yj = [xj Ij; therewith, we transform (2.2) into the problem 

max £(5) := V loggAt(y2;5'), (2.3) 

Sxo —'2=1 

where we define qj\/{yi\ S) := 2 tt exp{^)pj^(yi; S). Prop. 1 proves the key property of (2.3). 
Proposition 1. Let 4>{S) = —C{S), where C{S) is as in (2.3). Then, (j) is geodesieally convex. 

We omit the proof for space reasons; it may be found in the appendix. 

Theorem 2.1 shows that solving the reformulation (2.3) also solves the original problem (2.2). 
Theorem 2.1. //p*,S* maximize (2.2), and if S* maximizes (2.3), then C{S*) = C{ijC ,H*) for 

* - ( M-'- 1 j ■ 

Proof. We decompose S via Schur complements into the components (using Matlab notation): 

U = ->S'{l:d,d-|-l}S'{d+l,l:d}) f = S{l:d,d+l}j S = d-|-1.d-|-1} ■ 

*d-|-l,d-|-l 

®This reparamerization in itself is probably folklore; its role in GMM optimization is what is crucial here. 
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Time (seconds) 
(a) Single Gaussian 



(b) Mixtures of seven Gaussians 


Figure 1: The effect of reparametrization in convergence speed of manifold GG and manifold LBFGS methods 
{d = 35); note that the x-axis (time) is on a logarithmic scale. 


The objective function C{S) in terms of these parameters becomes 

C{U,t, s) = const + f s — f det(Lf) — — tY'U~^{xi — + 

Optimizing C over s > 0 we see that s* = 1 must hold; so we can eliminate s. Hence, the objective 
reduces to a d-dimensional Gaussian log-likelihood, for which clearly U* = S* and t* = fi*. □ 


Theorem 2.1 shows that the reparameterization is “faithful” as it leaves the optimum unchanged. 
Figure 1 shows the true import of this reparametrization: its dramatic impact on the empirical behavior 
Riemmanian Conjugate-Gradient (CG) and Riemannian LBFGS is unmistakable. 

Theorem 2.2. A local maximum of the reparameterized GMM log-likelihood 

is a local minimum of the original log-likelihood 

} jLi) : = ^ .^^ log aJp^r{x,\ fij , Sj) y 

Theorem 2.2 shows that we can replace (2.1) by a reparameterized log-likelihood whose local maxima 
agree with those of (2.1). Moreover, the individual components of the reparameterized log-likelihood 
are geodesically convex, which once again has a huge empirical impact (see Figure 1). 

We also need to replace the constraint a S Ak to make the problem unconstrained. We do this via 
a commonly used change of variables [13]: 

7?fc = log^^^, k = l,...,K -1. 

Assume rjK = 0 to be a constant, then the final optimization problem is given by: 


max 




n 

:= ^log( 


K 


exp(r7j 


^ Ef=iexp(ryfc) 


Z=1 


qNivt'.Sj) 


(2.4) 


We view (2.4) as a manifold optimization problem; specifically, it is an optimization problem on the 
product manifold (n^i X Let us see how to solve it. 
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3 Manifold Optimization 

A common approach for unconstrained optimization on Euclidean spaces is to iteratively apply the 
following two steps: (i) find a descent direction; and (ii) perform a line-search to obtain sufficient 
decrease (to ensure convergence). 

The difference when optimizing on manifolds is that the descent direction is computed on the 
tangent space. At a point X on the manifold, the tangent space Tx is the approximating vector space 
(see Fig. 2). Given a descent direction £ Tx, line-search is performed along a smooth curve on the 
manifold (red curve in Fig. 2). The derivative of this curve at point X equals the descent direction ^x- 
We refer the reader to [1, 28] for an in depth introduction to manifold optimization. 

Successful large-scale (Euclidean) optimization methods such as conjugate-gradient and LBFGS, 
combine gradients at the current point with gradients and descent directions from previous points to 
generate a descent direction at the current point. To adapt such algorithms to manifolds, in addition 
to defining gradients on manifolds, we also need to define how to transport vectors in a tangent space 
at one point, to vectors in a different tangent space at another point. 

On Riemannian manifolds, the gradient is simply defined as a direction on the tangent space, where 
the inner-product of the gradient and another direction in the tangent space gives the directional 
derivative of the function. Formally, if gx defines the inner product in the tangent space Tx, then 

Df{X)^ = gxigradf{X),^), Ioi^gTx- 

Given a descent direction in the tangent space, the curve along which we do the line-search can be 
a geodesic. A map that takes the direction and a step length, and yields a corresponding point on 
the geodesic is called an exponential map. A Riemannian manifold also comes with a natural way of 
transporting vectors on geodesics, which is called parallel transport. Intuitively, a parallel transport is 
a differential map with zero derivative along the geodesics. Algorithm 1 sketches a generic manifold 
optimization algorithm. 


Algorithm 1: Sketch of optimization algorithms (CG, LBFGS) on manifold 

Given: Riemannian manifold Ai with Riemannian metric g; parallel transport T on At; exponential map 
R-, initial value Xq; a smooth function / 

for A: = 0,1,... do 

Obtain a descent direction based on stored information and grad/(Afe) using defined g and T 
Use line-search to find a such that it satisfies appropraite conditions 
Calculate Xk+i = Rxfc(a^fe) 

Based on the memory and need of algorithm store Xk, grad/(Afe) and a^k 

end for 
return Xk 


Table 1 summarizes the key quantities for the positive definite matrix manifold. Note that a product 
space of Riemannian manifolds is again a Riemannian manifold with the exponential map, gradient and 
parallel transport defined as the Gartesian product of individual expressions; the inner product is defined 
as the sum of inner product of the components in their respective manifolds. 

Different variants of LBFGS can be defined depending where to perform vector transport. We found 
that the version developed in [27] gives the best performance. We implemented this algorithm together 
with the crucial line-search algorithm satisfying Wolfe conditions, which we now explain. 

3.1 Line-search algorithm satisfying Wolfe conditions 
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Definition 

Expression for PSD matrices 

Tangent space 

Metric between two tangent vectors C >7 at E 
Gradient at E if Euclidean gradient is V/(E) 
Exponential map at point E in direction ^ 
Parallel transport of tangent vector ^ from Ei to E 2 

Space of symmetric matrices 
<;5:(Cb)=tr(E-ieS-\) 
grad/(E) = iE(V/(X) + V/(X)^)E 
RE(0 = Sexp(E-iO 

rEi,E2(0 = e = (E2E7i)i/2 


Table 1: Summary of Riemannian expressions for PSD matrices 


To ensure LBFGS on the manifold always produces a descent 
direction, it is necessary to ensure that the line-search algorithm 
satisfy Wolfe conditions [24]. These conditions are given by: 

/(i?x,Kfe)) < f{Xk)+ciaDf{Xk)^k (3.1) 

Df{Xk+i)^k+i > C2Df{Xk)ik, (3.2) 

where 0 < ci < C 2 < 1. Note that aDf{Xk)^k = 
(gi'ad/(Xfc), a^fc), i.e., the derivative of f{Xk) in the direc¬ 
tion a^k equals the inner product of descent direction and gradi¬ 
ent of the function. Practical line-search algorithms implement 
a stronger version of (3.2), leading to the so-called strong Wolfe 
condition: 

\Df{Xk+i)ik+i\<C2Df{Xk)£.k- 

Similar to the line-search algorithm in Euclidean case, the 
line-search algorithm is divided into two phases: bracketing and zooming [22]. During bracketing, 
an interval is found such that a point satisfying Wolfe conditions can be found in this interval. In 
the zooming phase, the actual point in the interval satisfying the conditions is obtained. The one¬ 
dimensional function and its gradient that the line-search uses are defined as 0 (q;) = /{Rx^i^^k)) and 
(jf {oi) = aDf{Xk)^k, respectively. The algorithm is the same as the line-search in the Euclidean space, 
but we present details for its manifold incarnation in the appendix for the reader’s convenience. Theory 
behind how this algorithm is guaranteed to find a step-length satisfying (strong) Wolfe conditions can 
be found in [22]. 

The initial step-length ai can be guessed using the previous function and gradient information. We 
propose the following choice that turns out to be quite effective: 



Figure 2: Visualization of line-search on 
a manifold: X is a point on the manifold, 
Tx is the tangent space at the point X, 
is a descent direction at X; the red curve 
is the curve along which line-search is per¬ 
formed. 


J{Xk)-f{Xk-i) 

" Df{Xk)^k 


(3.3) 


Equation (3.3) is obtained by finding a* that minimizes a quadratic approximation of the function along 
the geodesic through the previous point (based on f{Xk-i), f{Xk) and Df{Xk-i)^k-i)- 


. _ J{Xk)-f{Xk-i) 

Df{Xk-l)Ck-l 


(3.4) 


Then assuming that hrst-order change will be the same as in the previous step, we write 


a*Df{Xk-i)^k-i « aiDf{Xk)ik. (3.5) 

Combining (3.4) and (3.5), we obtain our procedure of selection ai expressed in (3.3). Nocedal and 
Wright [22] suggest using either a* of (3.4) for the initial step-length ai, or using (3.5) where a* is set 
to be the step-length obtained in the line-search in the previous point. We observed the choice (3.3) 
proposed above, leads to substantially better performance than the other two approaches. 
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4 EXPERIMENTAL RESULTS 



EM Algorithm 
Time (s) ALL 

LBFGS Reparametrized 
Time (s) ALL 

CG Reparametrized 
Time (s) ALL 

c = 0.2 

K = 

2 

1.0 ± 0.5 

-11.3 

5.6 ± 3.2 

-11.3 

3.6 ± 1.9 

-11.5 


K = 

5 

35.4 ± 53.1 

-12.8 

50.0 ± 32.1 

-12.8 

47.1 ± 41.9 

-12.9 

c = 1 

K = 

2 

0.5 ± 0.2 

-10.8 

3.1 ± 1.0 

-10.8 

2.6 ± 0.7 

-10.8 


K = 

5 

103.6 ± 114.6 

-13.5 

72.9 ± 62.6 

-13.4 

42.4 ± 27.9 

-13.3 

c = 5 

K = 

2 

0.2 ± 0.2 

-11.2 

2.9 ± 1.4 

-11.2 

2.3 ± 0.9 

-11.2 


K = 

5 

36.1 ± 70.9 

-12.8 

27.7 ± 32.5 

-12.8 

30.4 ± 42.2 

-12.8 


Table 2: Speed and average log-likelihood (ALL) comparisons for d = 20, e = 10 (each row reports results 
averaged over 20 runs over different datasets, so the ALL values are not comparable to each other). 


4 Experimental Results 

We have performed numerous experiments to examine the effectiveness of the presented method. We 
report performance comparisons on both real and simulated data. In all experiments, we initialize 
the mixture parameters using k-means-|-+ [2], and we start all methods using the same initialization. 
All methods also use the same termination criteria: they stop either when the difference of average 
log-likelihood falls below 10“®, or when the number of iterations exceed 1500. Many more results for 
both simulated data and real data can be found in the appendix. 

Simulated Data 

EM’s performance is well-known to depend on the degree of separation of the mixture components [17, 
33]. To assess the impact of this separation on our methods, we generate data as proposed in [9, 31]. 
The distributions are sampled so their means satisfy the following inequality: 

\/i^j ■■ \\mi - rrijW > cmax{tr(Si) - tr(Sj)}, 

where c models the degree of separation. Since mixtures with high eccentricity have smaller overlap, 
in addition to high eccentricity e = 10 (eccentricity is defined as the ratio of the largest eigenvalue to 
the smallest eigenvalue of the covariance matrix), we also test the (spherical) case where components 
do not have any eccentricity, so e = 1. We test three levels of separation c = 0.2 (low), c = 1 (medium) 
and c = 5 (high). We test two different numbers of mixture components K = 2 and AT = 5; we consider 
experiments with larger values of K for our real data experiments. 

For e = 1, the results for data with dimensionality equal to 20 are given in Table 2. The results are 
obtained after running with 20 different random choices of parameters for each configuration. From the 
tables it is apparent that the performance of EM and Riemannian optimization with our reparametriza- 
tion are very similar. The variance of computation time shown by Riemmanian optimization is, however, 
notably smaller. 

In another set of simulated data experiments, we apply different algorithms for the case where there 
is no eccentricity; the results are shown in Table 3. The interesting case is the case of low separation 
c = 0.2, where the condition number of the Hessian becomes large. As predicted by theory, the EM 
converges very slowly in such a case; Table 3 confirms this claim. It is known that in this such a case, 
the performance of powerful optimization approaches like CG and LBFGS also degrades [22]. But both 
GG and LBFGS suffer less than EM, and LBFGS performs noticeably better than GG. 

Real Data 

We now present performance evaluation on natural image datasets, where mixtures of Gaussians were 
reported to be a good fit to the data [34]. We extracted 200,000 image patches of size 6x6 from images 
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EM Algorithm 
Time (s) ALL 

LBFGS Reparametrized 
Time (s) ALL 

CG Reparametrized 
Time (s) ALL 

o 

II 

u 

K = 

2 

72.9 ± 37.7 

17.6 

40.6 ± 21.6 

17.6 

49.4 ± 31.7 

17.6 


K = 

5 

396.7 ± 136.6 

17.5 

156.1 ± 80.2 

17.5 

216.3 ± 51.4 

17.5 

c = 1 

K = 

2 

7.0 ± 8.4 

17.1 

13.9 ± 13.7 

17.0 

16.7 ± 18.7 

17.0 


K = 

5 

38.6 ± 67.0 

16.2 

43.8 ± 38.5 

16.2 

58.4 ± 47.4 

16.2 

c = 5 

K = 

2 

0.2 ± 0.1 

17.1 

3.0 ± 0.5 

17.1 

2.7 ± 0.8 

17.1 


K = 

5 

26.4 ± 55.3 

16.1 

20.2 ± 18.4 

16.1 

23.3 ± 27.8 

16.1 


Table 3: Speed and ALL comparisons for d — 20, e = 1. 


EM Algorithm 
Time (s) ALL 

LBFGS Reparametrized 
Time (s) ALL 

CG Reparametrized 
Time (s) ALL 

CG Usual 

Time (s) ALL 

K = 

2 

16.61 

29.28 

14.23 

29.28 

17.52 

29.28 

947.35 

29.28 

K = 

3 

90.54 

30.95 

38.29 

30.95 

54.37 

30.95 

3051.89 

30.95 

K = 

4 

165.77 

31.65 

106.53 

31.65 

153.94 

31.65 

6380.01 

31.64 

K = 

5 

202.36 

32.07 

117.14 

32.07 

140.21 

32.07 

5262.27 

32.07 

K = 

6 

228.80 

32.36 

245.74 

32.35 

281.32 

32.35 

10566.76 

32.33 

K = 

7 

365.28 

32.63 

192.44 

32.63 

318.95 

32.63 

10844.52 

32.63 

K = 

8 

596.01 

32.81 

332.85 

32.81 

536.94 

32.81 

14282.80 

32.58 

K = 

9 

900.88 

32.94 

657.24 

32.94 

1449.52 

32.95 

15774.88 

32.77 

K = 

10 

2159.47 

33.05 

658.34 

33.06 

1048.00 

33.06 

17711.87 

33.03 


Table 4: Speed and ALL comparisons for natural image data d — 35. 


and subtracted the DC component, leaving us with 35-dimensional vectors. Performance of different 
algorithms are reported in Table 4. As for simulated results, performance of EM and manifold CG on 
the reparametrized parameter space is similar. Manifold LBFGS converges notably faster (except for 
K = 6) than both EM and CG. Without our reparamerization, performance of the manifold methods 
degrades substantially; because the experiments take too long to run, we report only the degraded 
behavior of CG, which runs about 20 times slower than reparametrized GG and LBFGS. Note that for 
N = 6 and N = 8^ GG without reparametrization stops because it hits the bound of a maximum 1500 
iterations, and therefore its ALL is smaller than the other two methods. 

5 Conclusions and future work 

We proposed Riemannian manifold optimization as a counterpart to the EM algorithm for fitting 
Gaussian mixture models. We demonstrated that for enabling manifold optimization to attain its true 
potential on GMMs, and to either match or outperform EM, it is necessary to represent the parameters 
in a different space and adjust the cost function accordingly. Extensive experimentation with both 
experimental and real datasets yielded quite encouraging results, suggesting that manifold optimization 
may hold the potential to open new algorithmic avenues for mixture modeling. 

Several strands of practical value are immediate from our work (and are a part of our ongoing 
efforts): (i) extension to large-scale mixtures (both large n and large K) through stochastic manifold 
optimization [6], especially given the importance of stochastic methods in the Euclidean setting; (ii) 
use of richer classes of priors with GMMs than the usual inverse Wishart priors (which are common, as 
they leave the M-step simple); this prior is actually geodesic convex and fits within the broader class 
of geodesic priors that our framework enables; (iii) incorporation of penalties for avoiding tiny clusters; 
such penalties fit in easily in our framework, though they are not as easy to use in the EM framework. 
Moreover, beyond just GMMs, exploration of other mixture models that can benefit from manifold 
optimization techniques is a fruitful topic worth exploring. 
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B LINE-SEARCH ALGORITHM 


A Technical details 

A.l Proof of Proposition 1 

First, we need the following lemma. 

Lemma 1. Let S, 0. Then, for a veetor x of appropriate dimension, 

< [x'^Sxf/^[x'^RxY/\ (A.l) 

Proof. Follows from [4, Thm. 4.1.3]. □ 

Proof. Proof (Prop. 1) Since (j) is continuous, it suffices to establish mid-point geodesic convexity: 

<^(7s.r(5)) < 14>{S) + \<f{R), for S,R& P". 

Denoting inessential constants by c, the above inequality turns into 

</>( 7 s , r ( 5 )) = ct>{S^/\S-^/^RS-^/^f/^R}/^) 

= -logdet(5i/2fii/2)-Pcy 

< -ilogdet(5)- \ \og<let{R) + c^fyjSyi\^/'^[yJRyf\^/‘^ 

< -ilogdet(S') + ^c'^^yfSy, - ilogdet(i^) \c^^yjRy^ 

= \f,{S) + \<f{R), 

where the first inequality follows from Lemma 1. □ 

A.2 Proof of Theorem 2.2 

Proof. Let S^,..., be a local maximum of C. Then, S* is the maximum of the following cost 
function: 

2 logdet(5'j) -I- - 2^.^^ w^y, Sj yi, 

where for each i G {1,..., n} the weight 

^ _ qAfjyxls*) 

T,f=iOijqN{y^\s*) 

Using an argument similar to that for Theorem 2.1, we see that s* = 1, whereby qj\f{yi\S*) = 
PJ\fixi]t*, U*). Thus, at a maximum the distributions agree and the proof is complete. □ 

B Line-search Algorithm 

Algorithm 2 summarizes a line-search algorithm satisfying strong Wolfe conditions. The zooming phase 
of the line-search is given in Algorithm B. Like in the Euclidean case ci is assumed to be a small 
number, here 10“^, and C 2 is a constant close to one, here 0.9. For interpolation and extrapolation one 
can find the minimum of a cubic polynomial approximation to the function in an interval. In each step 
of interpolation, the interpolation is done on an interval smaller that the actual interval to have specific 
distance from end-points of the interval (we used the distance to be 0.1 of the interval length). The 
interval for the extrapolation is assumed to be between 1.1 and 10 times larger than the point we are 
extrapolating from. For the cubic polynomial interpolation, we use the function (f>(.) and its gradient 
4>'i.) in the interval. For extrapolation, we use the function and gradient at 0 and at the end-point. 
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D SUPPLEMENTARY SIMULATED EXPERIMENTAL RESULTS 


Algorithm 2: Line-search satisfying Wolfe conditions Algorithm 3: ZoomingPhase 


1: Given: Current point Xk and descent direction 
2: <^(a) ^ f{Rx,{a^k)y, ^ 

3: oo 0, Or > 0 and i 0. 

4: while i < imax do 
5: i •<— i -I- 1 

6: if > <?^(0) -I-Ciai(^'(0) or 

* > 1 then 
7i Oiow — Oi—1 and Ohi — 

8: break 

9: else if \4>'(ai)\ < C2<;A'(0) then return Oi 

10: else if \(j)'{ai)\ > 0 then 

11: oiow = cti and Ohi = Oi-i 

12: break 

13: else 

14: Using extrapolation find Oi+i > ai 

15: end if 

16: end while 

17: Call ZoomingPhase 


1: while i < imax do 
2 : i ■(— i + 1 

3: Interpolate to find ai € (aiow,Q:hi) 

4: if (plai) > (^(0) -I- Ciaifpyo) or (piai) > (/>(aiow) 

then 

5: Ohi ^ ai 

6: else 

7: if \(j>'{ai)\ < C2</>'(0) then return at 

8: else if cp'{ai){aY,i — Oiow) > 0 then 

9. Ojii Olow 

10: end if 

11: Olow ^ ai 

12: end if 

13: end while 
14: return failure 


C Figure showing the effect of separation parameter 

A typical 2D data with K = 5 created for different separation is shown in Figure 3. 



(a) low separation (b) medium separation (c) high separation 


Figure 3: scatter data cloud for different degrees of separations. 


D Supplementary Simulated Experimental Results 

For the lower-dimensional cases and when the number of data is small, EM algorithm shows better 
performance than LBFGS optimization and pretty similar performance like CG. This is mainly because 
of the computational overhead like retraction and parallel transport that is needed to be computed for 
them. We believe that a more careful implementation will change the picture specially for the case 
of LBFGS. Because for performing parallel transport between Ei and E 2 , one can store the matrix 
and for performing the inner product at point E, it is possible to store the inverse of the 
matrix E; by storing these matrices the only computation remained is matrix product. 
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D.l Results for d = 20 


D SUPPLEMENTARY SIMULATED EXPERIMENTAL RESULTS 


We reported the result for d = 20 in the main text and because of the lack of space, we are reporting 
the result for usual CG below in Table 5. The results for low-dimensional cases d = 2 and d = 5 and 
for pretty small number of data-points n = d^ x 100 are shown in tables 6-11. 

D.l Results for d = 20 



e = 1 

Time (s) ALL 

e = 10 

Time (s) ALL 

II 

o 

II 

57.2 ± 49.7 17.6 

26.4 ± 29.2 -11.3 

K = 5 

225.3 ± 74.3 17.5 

216.7 ± 105.3 -12.9 

c=l K = 2 

43.1 ± 22.6 17.0 

27.5 ± 15.8 -10.8 

K = 5 

191.1 ± 86.8 16.2 

140.4 ± 39.7 -13.4 

c = 5 K = 2 

18.4 ±8.9 17.1 

36.4 ± 20.5 -11.2 

K = 5 

97.8 ± 48.1 16.1 

167.4 ± 90.5 -12.8 


Table 5: Speed and ALL for Usual CG and with d = 20. 


D.2 Results for d = 2 



EM Algorithm 
Time ALL 

LBFGS Reparametrized 
Time ALL 

GG Reparametrized 
Time ALL 

c= 0.2 

K = 

2 

0.4 ± 0.4 

0.6 

1.7 ± 1.0 

0.6 

0.6 ± 0.4 

0.6 


K = 

5 

1.4 ± 1.0 

-0.6 

7.3 ± 4.0 

-0.6 

2.1 ± 2.3 

-0.6 

c = 1 

K = 

2 

0.4 ± 0.3 

0.4 

1.4 ± 0.7 

0.4 

0.4 ± 0.2 

0.4 


K = 

5 

1.0 ± 1.0 

-1.3 

4.6 ± 2.7 

-1.3 

1.2 ± 0.8 

-1.3 

c= 5 

K = 

2 

0.0 ± 0.0 

0.2 

0.1 ± 0.1 

0.2 

0.1 ± 0.0 

0.2 


K = 

5 

0.1 ± 0.1 

-2.0 

2.0 ± 2.5 

-2.0 

0.4 ± 0.4 

-2.0 


Table 6: 

Speed and log-likelihood comparisons for d = 

2 and e = 10 




EM Algorithm 
Time ALL 

LBFGS Reparametrized 
Time ALL 

GG Reparametrized 
Time ALL 

II 

C<j 

o 

II 

0.7 ±0.6 1.8 

1.4 ± 0.8 1.8 

0.7 ± 0.4 1.8 

K = 5 

2.5 ± 1.8 1.8 

5.5 ± 1.7 1.8 

2.4 ± 0.9 1.8 

c=l K = 2 

0.7 ±0.5 1.6 

1.7 ±0.9 1.6 

0.8 ± 0.5 1.6 

K = 5 

2.1 ± 1.1 1.1 

5.1 ± 1.8 1.1 

2.8 ± 1.1 1.1 

c= 5 K = 2 

0.0 ± 0.1 1.1 

0.3 ± 0.3 1.1 

0.1 ± 0.1 1.1 

K = 5 

Table 7: 

0.3 ± 0.4 0.2 

Speed and log-like 

1.8 ± 1.3 0.2 

ihood comparisons for d = 

0.9 ± 0.6 0.2 

2 and e = 1 
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D.2 Results for d = 2 


D SUPPLEMENTARY SIMULATED EXPERIMENTAL RESULTS 



e = 1 

Time (s) ALL 

e = 10 

Time (s) ALL 

II 

o 

to 

II 

to 

1.0 ± 0.5 1.8 

0.9 ± 0.4 0.6 

K = 5 

5.3 ± 2.0 1.8 

6.7 ± 3.9 -0.6 

c=1 K = 2 

1.0 ± 0.4 1.6 

0.8 ± 0.4 0.4 

K = 5 

7.8 ± 4.8 1.1 

3.9 ±1.8 -1.3 

c = 5 K = 2 

0.7 ±0.7 1.1 

0.3 ±0.1 0.2 

K = 5 

4.8 ± 5.2 0.2 

3.2 ± 2.3 -2.0 


Table 8: Speed and ALL for Usual CG and with d = 2. 
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D.3 Results fcE d SUPPLEMENTARY EXPERIMENTAL RESULTS ON SOME REAL DATASETS 


D.3 Results for d = 5 



EM Algorithm 
Time ALL 

LBFGS Reparametrized 
Time ALL 

CG Reparametrized 
Time ALL 

c= 0.2 

K = 

2 

0.1 ± 0.0 

-1.4 

0.8 ± 0.6 

-1.4 

0.2 ± 0.1 

-1.4 


K = 

5 

3.1 ± 2.7 

-3.1 

14.0 ± 12.0 

-3.1 

3.6 ± 2.0 

-3.1 

c = 1 

K = 

2 

0.1 ± 0.0 

-0.7 

0.7 ± 1.0 

-0.7 

0.2 ± 0.2 

-0.7 


K = 

5 

0.7 ± 0.5 

-3.5 

7.0 ± 4.8 

-3.5 

1.7 ± 1.1 

-3.5 

c= 5 

K = 

2 

0.0 ± 0.0 

-1.1 

0.2 ± 0.2 

-1.1 

0.1 ± 0.1 

-1.1 


K = 

5 

0.9 ± 1.1 

-3.7 

4.9 ± 5.1 

-3.7 

1.4 ± 1.2 

-3.7 


Table 9: Speed and log-likelihood comparisons for d = 5 and e = 10 



EM Algorithm 
time ALL 

LBFGS Reparametrized 
time ALL 

GG Reparametrized 
time ALL 

c= 0.2 

K = 

2 

1.9 ± 2.0 

4.4 

3.6 ± 1.5 

4.4 

1.8 ± 1.1 

4.4 


K = 

5 

4.3 ± 1.7 

4.4 

13.9 ± 4.8 

4.4 

6.7 ± 2.4 

4.4 

c = 1 

K = 

2 

1.3 ± 1.2 

4.1 

2.1 ± 1.2 

4.0 

1.1 ± 0.7 

4.1 


K = 

5 

3.3 ± 2.2 

3.5 

9.5 ± 7.3 

3.5 

5.2 ± 2.1 

3.5 

c= 5 

K = 

2 

0.0 ± 0.0 

3.8 

0.2 ± 0.2 

3.8 

0.2 ± 0.1 

3.8 


K = 

5 

0.7 ± 1.5 

2.8 

3.3 ± 3.5 

2.8 

1.7 ± 1.9 

2.8 


Table 10: Speed and log-likelihood comparisons for d = 5 and e = 1 



e = 1 

Time (s) ALL 

e = 10 

Time (s) ALL 

c = 0.2 K = 2 

1.9 ± 0.8 4.4 

0.9 ± 0.2 -1.4 

K = 5 

12.2 ± 5.5 4.4 

10.6 ± 6.7 -3.1 

c=l K = 2 

3.2 ± 2.1 4.0 

1.3 ± 0.7 -0.7 

K = 5 

11.2 ± 4.9 3.5 

7.4 ± 3.0 -3.5 

c = 5 K = 2 

0.9 ± 0.4 3.8 

0.7 ±0.3 -1.1 

K = 5 

5.8 ± 3.5 2.8 

6.1 ± 6.0 -3.7 


Table 11: Speed and ALL for Usual CG and with d = 5. 


E Supplementary experimental results on some real datasets 

We selected some datasets from UCl machine learning dataset repository® and report the results for all 
of those we selected to perform the test on. As it can be seen from performance evaluations for these 
real datasets and also other simulated and real datasets in the main text, a systematic behavior for 
different optimization procedures can be observed. Namely by increasing the number of components, 
overlap increases leading to inferior performance of EM in compare to manifold optimization methods. 
For two of dataset, we normalized the features to have equal variance due to high variability of feature 
variances. 

® https:/ / archive .ics.uci.edu/ml/datasets 
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E.l Results foEMMISEI^MENm&bpKPERIMENTAL RESULTS ON SOME REAL DATASETS 


E.l Results for MAGIC gamma telescope 

In the case of MAGIC telescope dataset, the reparametrization proves to be extremely important, such 
that the stopping criterion of small cost difference is triggered without the algorithm being actually 
converged. 


EM Algorithm 
Time (s) ALL 

LBFGS Reparam 
Time (s) ALL 

CG Reparam 
Time (s) ALL 

CG Usual 

Time (s) ALL 

K = 

2 

0.28 

-28.44 

1.08 

-28.44 

0.54 

-28.44 

33.57 

-29.47 

K = 

3 

1.10 

-27.60 

4.12 

-27.56 

2.74 

-27.56 

127.98 

-29.14 

K = 

4 

3.59 

-27.29 

4.07 

-27.29 

2.14 

-27.29 

125.78 

-28.62 

K = 

5 

3.16 

-27.03 

7.40 

-27.03 

10.14 

-27.03 

222.54 

-28.75 

K = 

6 

10.16 

-26.90 

9.89 

-26.92 

8.86 

-26.92 

304.88 

-28.09 

K = 

7 

10.38 

-26.79 

11.02 

-26.87 

18.00 

-26.75 

395.92 

-27.99 

K = 

8 

9.01 

-26.64 

14.97 

-26.63 

16.04 

-26.64 

448.02 

-27.62 

K = 

9 

27.89 

-26.63 

18.74 

-26.66 

17.91 

-26.66 

505.05 

-27.66 

K = 

10 

16.16 

-26.47 

18.21 

-26.49 

22.23 

-26.49 

552.52 

-27.70 


Table 12: Speed and ALL comparisons for MAGIC gamma telescope data d = 10, n = 19020 


E.2 Results for (normalized) Corel image features 


EM Algorithm 
Time (s) ALL 

LBFGS Reparam 
Time (s) ALL 

CG Reparam 
Time (s) ALL 

K = 

2 

13.63 

-13.34 

18.18 

-13.34 

20.42 

-13.34 

K = 

3 

133.59 

-4.78 

164.52 

-4.78 

114.07 

-4.79 

K = 

4 

64.56 

0.26 

96.13 

0.26 

70.15 

0.26 

K = 

5 

178.76 

3.22 

110.39 

3.22 

91.87 

3.20 

K = 

6 

465.93 

4.53 

300.52 

5.25 

361.56 

5.24 

K = 

7 

646.85 

7.02 

347.00 

7.03 

712.65 

6.85 

K = 

8 

1124.44 

8.62 

442.05 

8.59 

557.63 

8.49 

K = 

9 

913.35 

9.84 

1163.63 

10.09 

981.04 

9.80 

K = 

10 

2213.15 

10.81 

592.88 

10.79 

1456.38 

10.79 


Table 13: Speed and ALL comparisons for Corel image features data d = 57, n = 68040 
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E.3 Results foEcoSOBREm^mi'^Am^fEiM’ERIMENTAL RESULTS ON SOME REAL DATASETS 


E.3 Results for combined cycle power plant 


EM Algorithm 
Time (s) ALL 

LBFGS Reparam 
Time (s) ALL 

CG Reparam 
Time (s) ALL 

K = 

2 

0.14 

-16.09 

0.31 

-16.09 

0.21 

-16.09 

K = 

3 

1.41 

-15.99 

3.99 

-15.98 

1.82 

-15.98 

K = 

4 

1.94 

-15.91 

4.56 

-15.91 

1.99 

-15.91 

K = 

5 

2.50 

-15.87 

3.40 

-15.88 

2.13 

-15.88 

K = 

6 

3.79 

-15.83 

7.56 

-15.82 

4.78 

-15.82 

K = 

7 

9.18 

-15.81 

7.39 

-15.80 

3.58 

-15.80 

K = 

8 

12.44 

-15.78 

17.04 

-15.74 

9.32 

-15.74 

K = 

9 

11.41 

-15.76 

17.39 

-15.76 

36.41 

-15.76 

K = 

10 

73.27 

-15.69 

52.41 

-15.69 

23.06 

-15.69 


Table 14: Speed and ALL comparisons for power plant data d = 5, n = 2568 


E.4 Results for (normalized) YearPredictionMSD 


EM Aigorithm 
Time (s) ALL 

LBFGS Reparam 
Time (s) ALL 

CG Reparam 
Time (s) ALL 

K = 

2 

248.14 

-86.67 

224.73 

-86.67 

196.11 

-86.67 

K = 

3 

352.74 

-82.00 

549.42 

-82.00 

752.15 

-82.00 

K = 

4 

816.22 

-79.79 

1212.66 

-79.79 

1832.93 

-79.79 

K = 

5 

5152.93 

-78.13 

5959.86 

-78.13 

3061.53 

-80.02 

K = 

6 

2921.52 

-76.96 

1415.24 

-76.96 

3084.32 

-76.96 

K = 

7 

4717.05 

-76.09 

4690.40 

-76.09 

5813.55 

-76.09 

K = 

8 

5528.35 

-75.32 

3466.55 

-75.32 

4518.16 

-75.32 

K = 

9 

10729.09 

-74.76 

5015.60 

-74.76 

8703.81 

-74.76 


Table 15: Speed and ALL comparisons for YearPredictionMSD data d = 90, n = 515345 
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F PSEUCODE FOR RIEMANNIAN LBFGS 


F Pseucode for Riemannian LBFGS 


Algorithm 4: L-RBFGS 


Given: Riemannian manifold A4 with Riemannian metric g; parallel transport T on Af; geodesics 
R; initial value Xq; a smooth function / 

Set initial i?diag = ^iVsxo (grad/ (Aq ), grad/ (Aq )) 

for A: = 0, Ij • ■ • do 

Obtain descent direction by unrolling the RBFGS method 
^ HESSMuL(-grad/(A:fc),A:) 

Use line-search to find a such that it satisfies Wolfe conditions 


Calculate X^+i = Rxk(o:^k) 

Define Sk = Txk,Xk+i(o:^k) 

Define Yfc = grad/(Afc+i) - (grad/(Afc)) 

Update -Afdiag — 9 x ^+1 D/g) 

Store Yfc; Sk', gXk+ii^kjYk)', gXk+ii^k, Sk)/gXk+ii^kjYk)', Hd'mg 

end for 
return Xk 

function HessMul(P, k) 

if fc > 0 then 


gx^^i{Sk,Pk+i) 

9Xk + i <Xk,Sk) 


Yk 


P = 7xfc+i,XfcHESSMuL(7xfc.Xfc+if’fc,A: - 1) return P 


gxk^AYkp) 

gxk+i (Yk,Sk) 


else 

return iJdiagD 
end if 

end function 


Sk + 


+ l (^k,Sk) p 
9x^+1 
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